Nature Genetics
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match Nature Genetics's content profile, based on 240 papers previously published here. The average preprint has a 0.33% match score for this journal, so anything above that is already an above-average fit.
Hof, J. J. P.; Ning, C.; Quinn, L.; Speed, D.
Show abstract
Common complex diseases are clinically heterogeneous, yet most genome-wide association studies (GWAS) assume cases are genetically homogeneous. This challenge is compounded in large-scale biobanks, which increasingly combine cases ascertained under different recruitment strategies, raising concerns that heterogeneous case definitions may dilute genetic signal. To address this, we developed StratGWAS, a scalable framework that leverages clinical features of heterogeneity to construct a transformed phenotype that better reflects genetic liability within diseases. StratGWAS stratifies cases using secondary phenotypic information such as age of onset, medication burden, or recruitment definition. StratGWAS then estimates genetic covariance between strata, and derives a transformed phenotype that upweights cases with higher inferred genetic liability. Through simulation studies (N = 100k) and application to the UK Biobank (N = 368k), we show that StratGWAS consistently outperformed standard GWAS methods. Applied to 21 UK Biobank traits, StratGWAS upweighted individuals with earlier disease onset and higher medication burden, yielding respectively 17% and 4% more independent genome-wide significant loci than standard case control GWAS. Applied to depression, StratGWAS upweighted individuals with multiple diagnoses, greater psychiatric comorbidity, or higher self reported depressive symptoms, identifying eight additional independent loci compared to case-control GWAS.
Adewuyi, E. O.; Auta, A.; Okoh, O. S.; Selmer, K.; Gervin, K.; Nyholt, D. R.; Pereira, G.
Show abstract
Observational studies associate type 2 diabetes (T2D) with increased dementia risk; however, the specificity of this relationship to Alzheimer's disease (AD) and its biological underpinnings remain unresolved. We apply an integrative cross-omic framework to dissect genetic links between AD and T2D. Genome-wide analyses reveal a modest positive genetic correlation and robust polygenic sign concordance of AD with T2D. High-resolution analyses demonstrate locus-specific heterogeneity, with coexisting positive and predominantly negative correlations, and strong inverse associations at APOE and HLA. Cross-trait GWAS meta-analyses indicate that most genome-wide significant signals reflect trait-specific effects, with only a limited set of variants supported in both AD and T2D. Colocalisation reveals distinct causal variants at most shared loci. Gene-based analyses highlight convergence at functional genes, including PLEKHA1, VKORC1, ACE, and APOE, without implying concordant variant-level effects. Bidirectional Mendelian randomisation (MR) shows no evidence of a causal relationship between AD and T2D in either direction. Summary-data MR prioritises genes whose expression or methylation affects both AD and T2D, mostly with opposing effects. Only PLEKHA1 (eQTL) and CAMTA2 (mQTL) show concordant positive associations. Five genes, GALNT10, HSD3B7, BCKDK, KAT8, and ACE, are supported across both regulatory layers, while numerous signals cluster within a regulatory hotspot at 16p11.2, supporting convergent transcriptional and epigenetic involvement, despite directional divergence. These results refine the AD-T2D relationship; rather than a simple shared-risk model, overlap reflects locus-specific heterogeneity and cross-omic convergence often showing opposing effects on AD versus T2D risk, consistent with antagonistic pleiotropy.
Huang, X.; Wang, Y.; Zhao, Q.; Gao, Z.
Show abstract
GWAS increasingly reveal shared genetic influences across neurodevelopmental, psychiatric, and neurodegenerative traits. However, cross-trait genetic covariance derived from GWAS summary statistics can be inflated by sample overlap and other structured background effects, obscuring higher-order genetic organization. We extend PathGPS, a recently developed statistical method that estimates an adjusted genetic covariance by subtracting a background covariance learned from weakly associated variants, and then extracts reproducible low-rank structure using rotation and bootstrap aggregation. When applying to 15 phenotypes related to neurodevelopmental and neurodegenerative disorders, the adjusted analysis yields four stable clusters with an interpretable topology. Adjusting for background covariance, which appears to be related to traumatic life experiences, sharpens the cluster boundaries and substantially shifts the clustering result for post-traumatic syndrome disorder. Simulations with controlled overlap and structured background covariance show that PathGPS has improved factor recovery relative to substantially shifts the clustering result for post-traumatic syndrome disorder.
Kyosaka, T.; Narita, A.; Kulski, J. K.; Minn, A. K. K.; Miyake, A.; Kotsar, Y.; Hiraide, K.; Ojima, T.; Nakatochi, M.; Namba, S.; Yamaji, T.; Sutoh, Y.; Sasaki, Y.; Broer, L.; Frost, F.; Koyanagi, Y. N.; Kasugai, Y.; Ito, H.; Sawada, N.; Nakano, S.; Suzuki, S.; Hishida, A.; Koyama, T.; Kubo, Y.; Funayama, T.; Makino, S.; Shirota, M.; Takayama, J.; Gocho, C.; Sugimoto, S.; Otsuka-Yamasaki, Y.; Tanno, K.; Abe, Y.; Nakajima, O.; Spaander, M. C. W.; Weiss, S.; Lerch, M. M.; Levy, D.; Hwang, S.-J.; Wood, A. C.; Rich, S. S.; Rotter, J. I.; Taylor, K. D.; Tracy, R. P.; Stocker, H.; Brenner, H.; Leja,
Show abstract
Helicobacter pylori (H. pylori) infects the gastric epithelium of approximately half of the global population, and is a well-known risk factor for developing gastric cancer. Despite the clinical significance of H. pylori infection, many genetic factors that contribute to susceptibility remain unidentified. While it is well-established that H. pylori infection can result in gastritis and peptic ulcers, which may progress to gastric cancer, its causal link to other diseases remains unclear. We performed the genome-wide association study (GWAS) for anti-H. pylori IgG antibody titers, which were validated as a surrogate marker for H. pylori infection by the correlation with clinical traits, followed by gene-based and pathway analyses, involving up to 140,863 individuals. This included 56,967 in the discovery phase, and 68,211 in the replication phase from Japanese cohorts, and an additional 15,685 from European populations in a cross-ancestry meta-analysis. We reveal significant associations between H. pylori infection and polymorphisms in Human Leukocyte Antigen (HLA) genes the Human Leukocyte Antigen (HLA) class II region within the Major Histocompatibility Complex (MHC), as well as genes related to innate immunity, including CCDC80, NFKBIZ, TIFA, PSCA, and TRAF3. Mendelian randomization (MR) analysis revealed that genetic liability to H. pylori infection has both positive and negative causal relationships with a variety of diseases, including autoimmune-related diseases such as Type 1 diabetes, Hashimoto's disease, atopic dermatitis, as well as traits like body height and weight. These genetic findings strongly support the notion that genetic liability to H. pylori infection influences not only gastrointestinal diseases, but also a broader spectrum of health issues, thereby providing valuable insights for public health strategies and personalized medicine approaches.
Le Guen, Y.; Pena-Tauber, A.; Catoia Pulgrossi, R.; Park, J.; Orias, H.; Greicius, M. D.
Show abstract
Alzheimers disease and related dementias (ADRD)1 and Parkinsons disease and related disorders (PDRD)2 have substantial genetic contributions, yet the role of rare damaging coding variants remains incompletely characterized at population scale3-6. We performed gene-based burden testing of rare loss-of-function and deleterious missense variants using whole-genome sequencing data from large population biobanks combined with disease-specific sequencing cohorts, leveraging proxy phenotypes to maximize statistical power for late-onset neurodegenerative diseases7. We confirmed rare variant burden in established ADRD genes (ABCA7, PSEN1, ADAM10, ATP8B4, GRN, SORL1, TREM2, SHARPIN) and PDRD genes (GBA1, LRRK2). We additionally identified novel associations in ADRD (IMPA2, PMM2, SYNE1, CHRNA4, FCGR1A) and PDRD (ANKRD27, CCL7, USP19, SKP1, KANSL3). The strongest signal was observed for ANKRD27, where damaging variants clustered within domains mediating interactions with Rab GTPases and retromer components. Our results demonstrate the power of population-scale sequencing combined with proxy phenotypes to identify rare coding risk genes for neurodegenerative diseases.
Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.
Show abstract
Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.
Liu, Y.; Deng, K.; Ye, Y.; Zhan, J.; Wang, Z.; Chen, S.; Hu, X.; Chang, A.; Li, Z.; Jin, X.; Liu, S.; Chen, K.; Shen, H.; Qi, X.; Xu, X.; Zhang, H.
Show abstract
Identifying disease-associated genetic variants remains a key challenge in genomics, especially in small cohorts or for rare and complex mutation types where genome-wide association studies (GWAS) often fall short. We introduce ATLAS, a population-level framework that leverages attention signals from pretrained genomic language models (gLMs) to detect disease-associated genes and loci directly from raw sequences--without requiring explicit variant calls or supervised training. ATLAS first performs gene-level differential attention analysis to prioritize candidate genes, followed by base-level analysis to localize disease-associated regions at single-haplotype resolution. We validate ATLAS on synthetic and {beta}-thalassemia datasets, demonstrating robust performance across diverse allele frequencies (down to 10%), cohort sizes (below 200 individuals per group), and genomic scales. Compared to GWAS, ATLAS achieves higher recall of known loci and captures haplotype-specific signals missed by traditional methods. Cross-model benchmarking further shows that precise localization depends on both model size and pretraining on diverse human genomes. In summary, ATLAS offers a scalable, sequence-native alternative to traditional statistical genetics.
Sigalova, O. M.; Pancikova, A.; De Man, J.; Theunis, K.; Hulselmans, G. J.; Konstantakos, V.; Stuyven, B.; De Brabandere, A.; Geurts, J.; Mikorska, A.; Mukherjee, S.; Abouelasrar Salama, S.; Vandereyken, K.; Davie, K.; Mahieu, L.; Adler, C. H.; Beach, T. G.; Serrano, G. E.; Voet, T.; Demeulemeester, J.; Aerts, S.
Show abstract
Genome-wide association studies (GWAS) have linked more than hundred non-coding genomic loci to Parkinsons disease (PD) risk. Deciphering their functional impact on gene regulation requires cell type-aware modeling approaches to assess the effects of sequence variation on enhancer function and target gene expression. To address this challenge, we generated a comprehensive matched dataset from 190 human donors (115 controls and 75 PD), comprising long-read whole-genome sequencing alongside single nucleus multiome atlases (snATAC-seq and snRNA-seq for 3.1 and 1.1 million nuclei respectively) of the anterior cingulate cortex and substantia nigra. By integrating chromatin accessibility quantitative trait loci (caQTL), DNA methylation QTL (meQTL), and allele-specific chromatin accessibility (ASCA), we identified 53,841 high-confidence cis-acting genetic variants that modulate cell type-specific enhancer accessibility in one or both brain regions. We then demonstrate that sequence-to-function models can accurately predict the impact of these variants directly from the genomic sequence. Novel explainability approaches allowed stratifying these variants according to their regulatory function, with the majority disrupting specific transcription factor binding sites in a cell type specific manner. Integrating these "enhancer variants" (EV) with eQTL mapping and gene locus modeling linked a subset of EVs to their target genes. Finally, we applied these models to prioritize regulatory variants at known PD GWAS loci, bypassing statistical limitations in rare disease-relevant populations like dopaminergic neurons. All together, we establish a unique resource and new sequence modeling strategies to interpret functional non-coding variation in the human brain.
Guez, J.; Goodrich, J. K.; Moldovan, M. A.; Chao, K. R.; Kar, P.; Panchal, R.; Wilson, M. W.; Laricchia, K. M.; Rohlicek, G.; Biba, D.; Marten, D.; He, Q.; Darnowsky, P. W.; Grant, R.; Weisburd, B.; Baxter, S. M.; Nadeau, J.; Lu, W.; Jahl, S.; Parsa, S.; Lamane, A.; DiTroia, S.; Fu, J.; Zhao, X.; Alarmani, E.; Tolonen, C.; Novod, S.; Bryant, S.; Stevens, C.; Chapman, S. B.; Cusick, C.; Vittal, C.; Gauthier, L. D.; Goldstein, J. I.; Goldstein, D.; King, D.; gnomAD Project Consortium, ; Tranchero, M.; Lotter, W.; MacArthur, D. G.; Brand, H.; Seplyarskiy, V.; Koch, E.; Talkowski, M. E.; Solomons
Show abstract
Accurate estimates of allele frequencies aid in genetic discovery, including rare disease diagnosis, common disease investigations, and population genetics. Here, we present the Genome Aggregation Database version 4 (gnomAD v4), comprising 807,162 sequenced individuals including 730,947 exomes, a fivefold increase over previous releases, and 76,215 genomes. We demonstrate that statistical power to detect strong selective constraint continues to increase with sample size. We develop a new loss-of-function annotation pipeline, which learns genomic features predictive of nonsense-mediated decay and splicing effects from selection signals, achieving 90% precision for distinguishing likely true versus false positive loss-of-function variants. This improved pipeline, along with incorporation of highly deleterious missense variants into measures of loss-of-function intolerance, improves disease gene detection particularly for short genes and those with gain-of-function mechanisms. To improve disease gene prediction, we systematically extract gene-disease associations from biomedical literature, map these to gene-level biological features, and integrate both with refined constraint metrics within a Bayesian framework, yielding state-of-the-art prediction of gene-disease relevance. Building on this integration, we define a Discovery Potential (DisPo) score that highlights genes under strong constraint but limited clinical characterization. High-DisPo genes are enriched in embryonic lethal and fertility phenotypes, supporting DisPo as a tool to prioritize previously under-characterized disease genes. Together, these advances establish a unified framework for accelerating gene discovery and improving rare disease diagnosis.
Collins, S.; Bah, I.; Pysar, R.; Mowat, D.; Turner, T. N.; Chatterjee, S.
Show abstract
Mowat Wilson syndrome (MWS) is a rare neurodevelopmental disorder caused by mostly heterozygous loss-of-function variants in ZEB2. Affected individuals show considerable wide variability in clinical presentation. In particular, Hirschsprung disease (HSCR) occurs in only a subset of patients, suggesting that additional genetic factors may modify disease penetrance. To investigate this possibility, we performed whole-genome sequencing of two parent-child trios in which the probands carried pathogenic de novo ZEB2 variants but differed in enteric phenotype: one individual with MWS and long-segment HSCR and another with MWS without HSCR. In both probands, the ZEB2 variants represent the primary causative genomic diagnosis, and no additional rare coding variants or excess copy-number burden provided a clear alternative explanation for HSCR. Phasing of a previously defined 10 single nucleotide polymorphisms(SNPs) RET enhancer haplotype revealed inheritance of a high-risk haplotype in the proband with HSCR, whereas the proband without HSCR carried only low-risk haplotypes on both chromosomes. To place these findings in a developmental context, we analysed single-cell transcriptomic data from the developing human fetal gut and neocortex. ZEB2 and RET show overlapping expression in enteric neural crest progenitors and neuroblasts but minimal overlap in the developing neocortex, indicating that reduced RET dosage is likely to have tissue-specific effects in the enteric nervous system. Together, these results support a model in which common regulatory variation at RET modifies HSCR penetrance in the setting of ZEB2 haploinsufficiency. More broadly, our findings illustrate how whole-genome sequencing can reveal regulatory modifiers that contribute to variable expressivity in ostensibly monogenic disorders
Paylakhi, S.; Geurgas, R.; Yasko, A.; Wedow, R.; Tegtmeyer, M.
Show abstract
Height and most disease risk are known polygenic traits: characteristics governed by multiple genes at different loci instead of a select few. Though we are beginning to understand how genetic variation impacts cell morphology, whether such an analogous polygenic architecture operates at the cellular level, where morphology integrates cytoskeletal organization, organelle positioning, and metabolic state, has yet to be systematically tested. Here, we demonstrate that cellular morphology behaves as a polygenic trait by integrating multimodal modeling, perturbation profiling, and population-scale genetic variation. A shared latent-space autoencoder trained on four large-scale perturbation datasets predicts morphology from gene expression and generalizes without retraining to matched RNA-seq and Cell Painting profiles from 100 genetically diverse iPSC donors. The model predicted 17 morphological features (R{superscript 2} > 0.6, permutation FDR q < 0.05), enriched for spatial organelle distribution and cytoskeletal architecture. Predictive performance does not arise from dominant gene-phenotype relationships: individual genes contribute modestly, and marginal gene-morphology correlations are uniformly weak, revealing a distributed regulatory architecture. Despite this polygenicity, CRISPR perturbation data from the JUMP consortium validates specific model-prioritized genes, such as the cytoskeletal regulator TIAM1, membrane trafficking factor RAB31, and mitochondrial-associated membrane transporter ABCC5, as molecular anchors whose disruption produces feature-specific morphological shifts. Transcriptome-wide association analyses identify correlational variant-gene-morphology chains linking cis-regulatory variation through mitochondrial metabolism (PDHX) and iron transport (SLC11A2) to cellular architecture. These results establish cellular morphology as a polygenic systems phenotype, extending the omnigenic framework to the cellular level and providing a biological basis for interpreting cross-modal prediction in functional genomics.
Höps, W.; Porubsky, D.; Yoo, D.; de Groot, M.; den Ouden, A.; Derks, R.; Hoekzema, K.; Hoischen, A.; Yntema, H. G.; Human Pangenome Reference Consortium (HPRC), ; Caro, P.; De Falco, A.; van Bon, B.; Brunetti-Pierri, N.; Schaaf, C. P.; Eichler, E.; Gilissen, C.
Show abstract
Human chromosome 15q13.3 is a hotspot for recurrent pathogenic copy number variants (CNVs), which remain unresolved at the sequence level. We generated haplotype-resolved assemblies for 10 patient-parent trios and found that both the long ("BP4-BP5") and short ("CHRNA7") forms of 15q13.3 CNVs arise predominantly by non-allelic homologous recombination (NAHR) enabled by inversion polymorphisms. While most BP4-BP5 CNVs are structurally distinct, three breakpoints cluster in a 2 kbp PRDM9-enriched recombination hotspot. CHRNA7 CNVs originate from NAHR between CHRNA7-LCR repeats embedded within locus-spanning inversions and give rise to paired deletion/duplication events. Population analyses of 581 population haplotypes reveal at least 18 distinct structural haplotypes in 15q13.3 and more than 10-fold ancestry-stratification of BP4-BP5 CNV risk, where 68.4% of Europeans but only 5.1% of East Asians are predisposed. Comparison to six ape species indicates that the duplication architecture promoting instability expanded recently and is largely human-specific.
Wright, H. I. W.; Darrous, L.; Ferrat, L.; Chundru, V. K.; Kamoun, A.; Wood, A. R.; Wright, C. F.; Patel, K. A.; Frayling, T. M.; Weedon, M. N.; Beaumont, R. N.; Hawkes, G.
Show abstract
Whole genome sequencing in diverse population-scale biobanks offers new insights into the genetic architecture of complex traits from rare and non-coding variants. However, rare single variant and aggregate associations are often confounded by linkage disequilibrium and haplotype structure, resulting in large numbers of false-positive associations. Previous methods that rely on reference panels or linkage disequilibrium-matrices to determine conditional independence in meta-analyses do not scale to very rare variants, which may be observed in only one biobank and can exhibit long-range haplotypes. Here, we implement a federated approach to perform iterative conditional meta-analysis on individual-level genotype and phenotype data across biobanks while adhering to data sharing policies. We applied our methodology to a meta-analysis of LDL-C in 614,375 individuals from UK Biobank and All of Us, encompassing six genetic ancestry groups. After conditioning, only 4.3% of significantly associated rare single variants and 6.9% of aggregates remained statistically independent. The proportion of significant aggregates that remained independent after conditioning was higher for coding-based tests than non-coding. We further validate that our approach effectively suppresses false-positive associations using simulations centred on the LDLR locus. We identify allelic series of variants associated with reduced LDL-C, including loss-of-function variants in DNAJC13 and variants in the 3-prime untranslated region of LDLR. Our results highlight that federated conditioning can distinguish independent rare variant signals from linkage and haplotype structure artifacts in multi-ancestry meta-analyses across separate biobanks.
Tsitkov, S.; Raju, A.; Wu, J.; Li, J.; Lim, R. G.; Wu, Z.; Al Bistami, N.; Answer ALS Consortium, ; Van Eyk, J.; Svendsen, C.; Rothstein, J. D.; Glass, J. D.; Finkbeiner, S.; A Kaye, J.; Thompson, L. M.; Fraenkel, E.
Show abstract
1Amyotrophic lateral sclerosis (ALS) is highly heritable, yet the vast majority of cases lack an identifiable genetic cause and clinical progression remains largely unpredictable. To connect noncoding and rare genetic variation to disease phenotypes in a relevant cell type, we generated a multi-omic quantitative trait locus (QTL) atlas from 594 induced-pluripotent-stem-cell-derived human motor neuron lines (522 ALS patients, 72 controls). By mapping cis-QTLs for chromatin accessibility, splicing and gene expression from whole-genome sequencing, we identify common and rare variants on the wild-type C9orf72 allele that form regulatory haplotypes. These haplotypes influence C9orf72 expression levels in motor neurons and stratify C9-ALS patients into four subgroups; using clinical disease duration data and longitudinal ALSFRS-R scores, we show that these subgroups exhibit different survival trajectories, indicating that wild-type C9orf72 expression acts as a genetic modifier of disease duration. Beyond the C9orf72 locus, we detect ultra-rare intronic variants that create cryptic exons and structural and nonsense variants in established ALS genes, providing likely genetic explanations for disease in additional patients who previously lacked a molecular diagnosis. Our results show that QTL mapping in patient-derived motor neurons can reveal regulatory modifiers of progression and hidden pathogenic events in ALS, providing a framework for genetically informed risk attribution and patient stratification in complex neurological diseases.
Tesi, N.; Salazar, A.; Bouland, G.; Alvarez Sirvent, D.; Zhang, Y.; Knoop, L.; van Schoor, N. M.; Huisman, M.; Wijesekera, S.; Krizova, J.; Tijms, B.; Vijverberg, E.; ADGC, Bonn, CHARGE, EADB, EADI, FinnGen, GERAD, GR@ACE/DEGESCO, PGC-ALZ, ; Hulsman, M.; van der Lee, S. J.; Reinders, M.; Holstege, H.
Show abstract
Genome-wide association studies (GWAS) have identified over 100 Single Nucleotide Polymorphisms (SNPs) associated with Alzheimers disease (AD) risk, however, most signals tag haplotypes rather than causal variants. This highlights the need to characterize haplotype-specific variation, including structural variants (SVs) and epigenetic modifications, as these may play a central role in shaping downstream disease mechanisms. We applied linkage disequilibrium (LD)-based clumping, followed by conditional analysis to identify significant and independent haplotypes associated with AD. Through long-read sequencing of 493 individuals, we systematically characterized the SV and DNA methylation landscape of these haplotypes. We integrated allele-specific differential methylation and chromatin organization to prioritize SVs likely contributing to disease mechanisms. Finally, we explored the feasibility of imputation approaches to predict SV size in 5,936 array-genotyped individuals. Using AD-GWAS summary statistics for 98 GWAS loci we identified 280 independent and significant haplotypes. We then identified 2,000 unique SVs that were in LD (R{superscript 2}>0.15) with 207/280 haplotypes. These SVs were predominantly composed of intronic transposable elements and tandem repeats, largely multi-allelic and overlapping regulatory regions. Based on differential methylation, genomic and chromatin co-localization, we prioritized 52 SVs as candidate contributors to disease mechanisms: 14 of these were in high LD with AD-haplotypes (R{superscript 2}>0.8), 12 were in moderate LD (R{superscript 2}>0.5), and 26 were in low LD (R{superscript 2}>0.15). We identified intronic SVs in TMEM106B, CYSTM1, IPMK, LMAN2, MINDY2, as well as likely regulatory and exonic SVs in APP, NDUFS2, TMEM184A, STRN4, CNN2, ADAM10, and other loci. Fine mapping of the PLEC/SHARPIN locus revealed a novel haplotype with a tandem repeat expansion driving enhancer methylation and reduced PLEC expression in microglia. Finally, we imputed 83% of SVs with high accuracy (N=1,651, mean R{superscript 2}=0.76), and association with AD status of imputed SVs yielded 112 significant associations (FDR<0.05). AD risk loci are genetically complex, often comprising multiple haplotypes and linked SVs that could contribute to disease mechanisms. Integrating long-read sequencing, epigenetic data, and imputation strategies provides a more nuanced view of AD genetic architecture and highlights SVs as potential drivers of disease risk.
Elmore, A. R.; Hanson, A. L.; Leyden, G. M.; Johnson, J.; Davey Smith, G.; Paternoster, L.; Gaunt, T. R.; Hemani, G.
Show abstract
Mapping the pleiotropic effect of genetic variation on biological processes and complex phenotypes is fundamental to extracting translational insight from genome-wide association studies (GWAS). Here we present The Human Genotype-Phenotype Map (GPMap), a repository of colocalizing genetic associations across 15,997 complex traits and 2.7 million molecular measurements, leveraging common and rare variants and cis-and trans-acting effects across disaggregated tissue types and single cell datasets to trace the complex pathways through which they act. We identify over 49.3 million colocalizing trait pairs, which aggregate into 97,393 colocalization groups, representing distinct pleiotropic variants based on shared genetic signals, with 55.8% of genome-wide significant disease-associated loci colocalizing with at least one molecular trait. This insight facilitates clustering of complex health and disease phenotypes based on genetic architecture, and the dissection of polygenic traits reflecting the composite impact of many underlying processes. We show that leveraging pleiotropic information can enhance the selection of genetic instruments for causal inference approaches and improves prediction of drug trial success. This open-source resource is available at https://gpmap.opengwas.io, with functionality for user GWAS upload.
Kramer, B. K.; Kushner, S. A.; Rzhetsky, A.
Show abstract
Birth order has been implicated in the etiology of individual diseases, but has never been systematically assessed at phenome-wide scale with large administrative claims data and complementary epidemiological designs. Here we use two complementary approaches: a between-family matched cohort of 1.6 million pairs and a within-family sibling comparison which includes 5.1 million families and 10.3 million individuals. These were both applied to 569 diseases defined by the ICD9-CM/ICD10-CM codes in the commercial claim data of Merative MarketScan. Of 418 diseases with adequate case counts, 150 show Bonferroni-significant birth-order associations. All odds ratios compare second-borns with first-borns, so OR < 1 indicates first-born excess. First-borns are at an excessive risk for neurodevelopmental conditions (autism OR = 0.74, ADHD OR = 0.93) and immune-allergic diseases consistent with the hygiene hypothesis (food allergy OR = 0.80, allergic rhinitis OR = 0.91), while second-borns are at an excessive risk for substance abuse (OR = 1.19) and gastrointestinal conditions. Between-family and within-family estimates agree in direction for 84.7% of significant diseases (Pearson r = 0.65), and results are robust to state fixed effects (r = 0.997) and full-sibling restriction. Prespecified validation controls were broadly consistent with expectations. These findings provide a comprehensive map of birth-order effects across the human disease phenome.
Boone, P. M.; Erdin, S.; Mohamed, A.; Haghshenas, S.; Faour, K. N. W.; Kao, E.; Fu, J.; Auwerx, C.; Harripaul, R.; Jana, B.; Springer, D.; Hallstrom, G.; de Esch, C. E. F.; Denhoff, E.; Holmes, L.; Mohajeri, K.; Lemanski, J.; Kerkhof, J.; McConkey, H.; Rzasa, J.; McCune, M. J.; Levy, M. A.; Grafstein, J.; Larson, M.; Wright, Z.; Beauchamp, R. L.; Lucente, D.; Abou Jamra, R.; Agrawal, N.; Agrawal, P. B.; Andersen, E. F.; Argilli, E.; Araiza, R.; Ballal, S.; Baxter, M. F.; Bergant, G.; Bertsche, A.; Bhavsar, R.; Bortola, D. R.; Bothe, V.; Brasch-Andersen, C.; Braun, D.; Bruel, A.-L.; Buchanan, C
Show abstract
Cohesin is a fundamental genome-organizing complex that orchestrates three-dimensional chromosome folding and gene expression via DNA loop extrusion. Alterations to genes encoding cohesin subunits and cohesin loaders cause Mendelian disorders, including Cornelia de Lange syndrome (CdLS). By contrast, disruption of factors that remove cohesin from DNA, including WAPL and its binding partners PDS5A and PDS5B, have not yet been associated with human disease. Here, we explored the relevance of these cohesin release factors in Mendelian disease by establishing a rare disease cohort of deeply phenotyped individuals with heterozygous, predicted damaging variants in WAPL (n=27), PDS5A (n=8), and PDS5B (n=8), by modeling WAPL deficiency in human cell lines and mice, and by aggregating rare disease association statistics from consortia studies. We identified a WAPL-related disorder characterized by developmental delay, intellectual disability, and risk of other developmental anomalies including clubfoot. Similarities between individuals with damaging WAPL variants and those with large, recurrent 10q22.3q23.2 (10q) deletions (which encompass WAPL) nominate WAPL as a driver gene within this genomic disorder region. While carriers of PDS5A or PDS5B variants exhibited features of developmental disorders, neither cohort-based statistics nor case phenotyping associated these genes with specific phenotypes. We used CRISPR engineering to generate truncating variants in WAPL, as well the 7.8 Mb 10q deletion or duplication in human iPSCs and induced neurons. Transcriptomic analyses identified differentially expressed genes in both models, with highly significant overlap between WAPL haploinsufficiency and 10q deletion signatures. Mice with 50% residual Wapl expression exhibited mild deficits of growth and learning/memory, whereas those with 25% residual Wapl expression displayed birth defects and postnatal lethality, revealing a dosage liability threshold below the level of heterozygosity. In summary, we delineated a novel genetic condition caused by cohesin release factor deficiency, nominated WAPL as a driver gene within a genomic disorder region, and further illuminated dosage sensitivity of human cohesin.
Li, H.; Zhang, H.; Zhu, D.; Zhao, P.; Wei, Z.; Lu, J.; Gong, M.; Zhang, Q.; Zheng, W.; Liu, X.; GUAN, D.; Teng, J.; Lin, Q.; Tang, Y.; Gao, Y.; Zhao, S.; Zhang, Z.; Du, J.; Fang, C.; An, B.; Lin, B.; Zhang, H.; Tian, M.; Tian, J.; Chen, S.; Liu, W.; Wang, Y.; Wang, M.-S.; Ibeagha-Awemu, E. M.; Crooijmans, R.; Derks, M.; Godia, M.; Madsen, O.; Pausch, H.; Leonard, A. S.; Frantz, L.; MacHugh, D. E.; Grady, J. F. O.; Ionita-Laza, I.; Zhao, X.; Guan, L.; Zhou, H.; Marmol-Sanchez, E.; van der Wijst, M.; Lu, X.; Jiang, H.; Yang, Z.; Yang, Q.; Liu, Q.; Xu, C.; Li, M.; Hou, Y.; Pan, Z.; Chen, Y.; Xian
Show abstract
Cattle are integral to global food security, yet the molecular architecture of their complex traits remains poorly understood. Here, we present the Cattle Genotype-Tissue Expression (CattleGTEx) Phase 1 resource (https://cattlegtex.farmgtex.org/), a substantial expansion of the pilot study. By leveraging 12,422 RNA-seq profiles across 43 tissues and 82 breeds, we characterized 433,972 primary and 161,428 non-primary regulatory effects spanning seven molecular phenotypes. This high-resolution atlas resolves 75% of GWAS signals for 44 complex traits, significantly addressing the "missing regulation" in livestock. We propose a genetic regulatory model demonstrating how variants across multiple biological layers interact with specific biological contexts to shape phenotypic variation. Furthermore, CattleGTEx elucidates mechanisms underlying adaptive evolution between Bos taurus and Bos indicus, as well as artificial selection in dairy and beef breeds. Finally, by mapping evolutionary constraints on these regulatory effects, we demonstrate the translational value of this resource for prioritizing causal variants in human complex diseases. Together, Phase 1 of CattleGTEx provides a transformative framework for functional genomics, precision breeding, and comparative genetics.
Lawrence, J. M.; Breunig, S.; Schaffer, L. S.; Sheppard, A.; Zorina-Lichtenwalter, K.; Grotzinger, A. D.
Show abstract
Major depression (MD) is a disorder class that exhibits substantial phenotypic and clinical heterogeneity, yet many large-scale molecular genetic investigations treat MD as a unitary outcome. Here, we applied Genomic Structural Equation Modeling (Genomic SEM) to characterize the genetic variation in two clinically relevant MD subtypes, childhood-onset (child-onset) and treatment-resistant MD, that are independent of the field-standard GWAS of MD in all its forms. In addition, we fit a complementary "boosting" model that leveraged shared signal across the subtype and general MD GWAS to increase power for subtype biological discovery. At the genome-wide level, more than half of the common-variant liability for child-onset and treatment-resistant MD was unique relative to the general MD GWAS, indicating substantial subtype-specific genetic architecture. Unique components of both subtypes showed robust associations with genetic liability for schizophrenia and bipolar disorder, and the child-onset specific component exhibited genome-wide overlap with early developmental outcomes, including autism spectrum disorder and childhood intelligence. Transcriptome-wide analyses implicated upregulation of SMIM19 in liability specific to child-onset MD, while stratified functional enrichment highlighted gene sets involved in limbic and frontal brain systems for the boosted child-onset component. Together, these findings demonstrate that MD contains biologically distinct subtypes that exhibit etiological divergences more akin to separate disorders than subtypes within a concrete diagnostic framework. We find that stratifying MD by biologically distinguishable subtypes may be crucial for enhancing biological discovery and elucidating etiological pathways in molecular genetic studies of depression.